3. Protecting Data in Motion
Any sort of cloud backup solution obviously must transfer bits to the
cloud and back. An attacker shouldn’t be able to peek at your data as it
is flowing through the Internet. Note that, although an attacker can’t
look at your data, there is no easy protection from attackers detecting
that some data transfer is happening in the first place, or guessing at
the amount of data being sent. Size and timing over any communication
protocol is difficult to keep secret.
Thankfully, you need to do very little to actually protect your data
in motion. The Windows Azure storage APIs have an HTTPS variant that is protected using Secure Sockets Layer
(SSL)/Transport Layer Security (TLS). You can perform the same API
operations, but with an https:// URI instead of an http:// URI. This will
give you a great deal of security with only a small drop in performance.
Most people would stop with this. But as mentioned, paranoia will reign in
this chapter. This means you must ensure that the way you do SSL is
actually secure in itself.
This looks easy on the surface, but it actually requires some work
from the client side. The common mistake developers make is to replace all
http:// URIs with https:// URIs in their code. Though this will let the
application continue working, it is actually insecure. To understand the
right way to do this and why just making calls to https:// URIs isn’t
sufficient, let’s first take a quick peek at how SSL works.
SSL (or TLS, to use its more modern reference) is a way for
client/server applications to communicate securely over a network without
risk of eavesdropping, tampering, or message forgery. Most everyone has
probably interacted with an SSL-based website (for example, when shopping
online), and a lot of programmers have probably built web services or
websites secured using SSL.
SSL is built on two core concepts: certificates and certification authorities.
A certificate (an X.509v3 certificate, to be specific) is a wrapper around a
public key and a private key, installed together on the web server. When
the browser contacts the web server, it gets the public key, and uses some
standard cryptographic techniques to set up a secure connection with data
that only the server can decrypt. That is sufficient to ensure that you
are securely talking to the server, but how do you tell that the server
itself can be trusted? How do you know that https://www.paypal.com is actually the company
PayPal?
Note: To understand what happens under the covers with certificates, see
a great blog post that goes into excruciating detail at http://www.moserware.com/2009/06/first-few-milliseconds-of-https.html.
This is where a certification authority
(CA) comes in. CAs are a few specialized companies,
trusted by your browser and/or your operating system, whose only purpose
is to verify the authenticity of certificates. They do this through
various offline mechanisms—anything from a phone call to requiring a fax.
Once they’ve verified that the person asking for a certificate to, say,
https://www.paypal.com is actually the company
PayPal and not some scammer, they sign PayPal’s certificate with their
own.
Figure 1 shows PayPal’s certificate. You can see that it has been
authenticated by VeriSign, which is a trusted CA.
Similarly, Figure 2
shows the certificate Microsoft uses for Windows Azure blob storage. In
this figure, you can see the “certification chain.” CAs often sign certifications of other
CAs, who in turn can validate individual certificates or sign yet another
CA, thereby constructing a chain. In this case, you see at the bottom that
*.blob.core.windows.net is signed by Microsoft Secure
Server Authority. If you follow the chain, you wind up at GTE CyberTrust, which is a well-known CA that is trusted on
most browsers and operating systems and is now owned by Verizon.
When you make a storage request to https://*.blob.core.windows.net, your connection should be
secured by the certificate shown in Figure 12-2. As you might have
figured by now, the connection is insecure if the server is actually using
some other certificate. But how is that possible? Unfortunately, the
answer is that it is quite easy.
One attack would be to redirect your request to a server that has a
certificate chaining to a valid CA, but not for
blob.core.windows.net. The attacker would use a
man-in-the-middle (MITM) attack where he redirects your
request to a server of his choosing. If your code only asks for an SSL
connection, but doesn’t check that the certificate it is getting matches
the domain/host it is connecting to, it can get fooled by the server
presenting any certificate.
Another attack is equally easy to execute. The attacker generates a
special kind of certificate called a self-signed certificate where, instead of getting
it validated by a CA, it is validated by itself. In the real world, that’s
like asking people for a proof of identity and they hand you a piece of
paper signed by them and certifying them to be who they say they are. Just
like you wouldn’t trust that piece of paper, you shouldn’t trust
self-signed certificates either.
Note: This is not to say that self-signed certificates are insecure.
They are very useful in various other scenarios, and are even used in
other parts of Windows Azure for perfectly valid, secure reasons.
Figure 3 shows a
self-signed certificate generated for
*.blob.core.windows.net. Apart from the fact that it
is “signed” by itself, note how it looks like a legitimate certificate in
every other aspect.
There is another way to fool SSL clients that is more difficult to
protect against. CAs are not equal, and some are more lax in how they verify
a request than others are. In several cases, attackers have fooled CAs
into issuing certificates for well-known domains. Not only should you
check for the presence of a CA, but you should also check for the presence
of the right CA!
Python up to version 2.5 doesn’t have any good mechanism to do
these checks. Python 2.6 added an SSL module that included some, but not
all, of this functionality. Let’s use the OpenSSL library to do the heavy
lifting in terms of verifying SSL certificates. OpenSSL is a native
library that can’t be accessed directly from Python. The M2Crypto package
provides a nice Pythonic wrapper around OpenSSL.
Note:
It is easy to implement the same using .NET, Java, or pretty much any
modern programming platform. In .NET, look at
System.Net.ServicePointManager’s ServerCertificateValidationCallback.
In Java, look at javax.net.ssl.TrustManager. Both .NET and Java do some
of this validation (checking a certificate’s validity, whether the
certificate’s Common Name matches the hostname) by default, but not all
of it. For example, neither can actually check whether the certificate
chains to the right CA.
You can find the entire source code for this in the storage.py file as part of the azbackup source tree. USE_HTTPS is a useful
variable that you’ll use to toggle whether the storage library should use
HTTPS. HTTPS can be quite painful when debugging, and having the ability
to turn it off with a configuration option is useful.
from M2Crypto import httpslib, SSL
USE_HTTPS= False
The first task is to switch over from using an HTTP connection to an
HTTPS connection. Doing so provides an unexpected bonus. M2Crypto (or, to
be specific, OpenSSL) checks whether the hostname matches the
certificate’s Common Name (CN), and takes care of the first of the two
attacks previously described. Example 1
shows how to make _do_store_request use
an SSL connection if required. The actual SSL connection is made using
M2Crypto’s httpslib, which is a
drop-in, interface-compatible SSL version of Python’s httplib.
Example 1. Using an HTTPS connection
# Create a connection object if USE_HTTPS: ctx = SSL.Context() # The line below automatically checks whether cert matches host connection = httpslib.HTTPSConnection(account_url,ssl_context=ctx) else: connection = httplib.HTTPConnection(account_url)
|
The next step toward protecting your SSL connection is to ensure
that it comes from the right CA. This is a tricky proposition. There are
several well-known CAs, and depending on your programming platform and
operating system, you might trust any number of them. Windows has a
default list of CAs it trusts (which you can find by running certmgr.msc). OS X has a default list that you
can find in KeyChain, and Firefox maintains a huge list of CAs that it
trusts. .NET trusts everything that Windows trusts by default. OpenSSL and
M2Crypto trust no CA by default, and expect you to tell them which ones to
trust. All in all, this is quite messy and confusing.
You solve this by using the “let’s be paranoid” principle. Instead
of trusting several CAs, you trust only one CA: the
CA that issues the certificate for
*.blob.core.windows.net, namely, GTE CyberTrust. To make M2Crypto/OpenSSL trust this CA, you
need GTE CyberTrust’s certificate in a specific format: PEM-encoded. (PEM
actually stands for “Privacy Enhanced Email,” though its common uses have
nothing to do with “privacy” or “email,” or any form of “enhancement.” It
is a simple technique for representing arbitrary bytes in ASCII-compatible
text. The best reason to use it is that it is widely supported.) There are
multiple ways to do this.
Note: This paranoid approach means your application could break if
Microsoft appoints someone else to issue its SSL certificate. After all,
nowhere does Microsoft promise that it will stick to the same SSL
provider.
The easiest way to do this is to export the certificate from Firefox
or Internet Explorer. Add that to your source as the file cacerts.pem and place it in the same directory as
the rest of the source code.
You can now check whether the certificate for the HTTPS connection
is signed by a chain that ends in “GTE CyberTrust.” Example 2 shows the code to do that.
This is a modification of _do_store_request and builds on the code shown
in the previous example. Example 12-2 uses the SSL.Context class from M2Crypto, and sets it to
verify that the endpoint you are connecting to has a certificate and that
it is a certificate you trust. The code then adds a list of trusted root
certificates by looking for a file called cacerts.pem, in which you place GTE CyberTrust’s
certificate.
Example 2. HTTPS checking for a specific CA
# Create a connection object if USE_HTTPS: ctx = SSL.Context()
# Verify that the server chains to a known CA. # We hardcode cacerts.pem in the source directory # with GTE CyberTrust's certs which is what # Windows Azure chains to currently. ctx.set_verify(SSL.verify_peer | SSL.verify_fail_if_no_peer_cert, 9)
# GTE's certs are kept in cacerts.pem in the same directory # as source. sys.path[0] always # contains the directory in which the source file exists if ctx.load_verify_locations(sys.path[0] + "/cacerts.pem")!=1: raise Exception("No CA certs")
# The line below automatically checks whether cert matches host connection = httpslib.HTTPSConnection(account_url,ssl_context=ctx) else: connection = httplib.HTTPConnection(account_url)
# Perform the request and read the response connection.request(http_method, path , data, headers) response = connection.getresponse()
|
You now have some secure SSL code that you can rely on to only talk
over an encrypted channel to the Windows Azure blob storage service. How
much performance overhead does it add? Surprisingly little. During
performance tests, several thousand iterations were run before any
perceptible performance difference could be measured by using SSL. Your
results may vary, and you should always test before making a change like
this.
Note that you can go further if you want. There are always more
things to check and enforce—such as certificate validity, the cipher suite
picked by the SSL connection, dealing with proxy servers along the way,
and so on.